modern greek
GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Chatzikyriakidis, Stergios, Papadakis, Dimitris, Papaioannou, Sevasti-Ioanna, Psaltaki, Erofili
We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
- North America > United States (0.26)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Italy > Calabria (0.04)
- (4 more...)
Can AI mimic the human ability to define neologisms?
An ongoing and intriguing debate focuses on whether Large Language Models (LLMs) can replicate human language. The literature presents mixed evidence on this matter. Several studies suggest that LLMs can generate text closely resembling human language (Bubeck et al., 2023; Clark et al., 2021; Georgiou, 2025). However, the widely accept ed concept of a universal grammar inherent in humans (Chomsky, 2000) challenges the idea that machine cognition can mirror human cognition. According to Chomsky et al. (2023), models like ChatGPT function as statistical engines driven by pattern recognitio n. Supporting this perspective, other studies highlight significant differences between human cognition and LLMs, which are reflected in language (Cai et al., 2024; Georgiou, 2024; Herbold et al., 2023). For instance, Georgiou (2024) examined how various linguistic components are represented in human - written and AI - generated texts, assessing the ability of ChatGPT to emulate human writing. The author found that d espite AI - generated texts appear ing to mimic human language, the results revealed signifi cant differences across multiple linguistic features in the domains of phonology, grammar, and semantics.
- North America > Canada > Alberta (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Greece > Central Macedonia > Thessaloniki (0.05)
- (2 more...)
Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP
Bakagianni, Juli, Pouli, Kanella, Gavriilidou, Maria, Pavlopoulos, John
Natural Language Processing (NLP) research has traditionally been predominantly focused on English, driven by the availability of resources, the size of the research community, and market demands. Recently, there has been a noticeable shift towards multilingualism in NLP, recognizing the need for inclusivity and effectiveness across diverse languages and cultures. Monolingual surveys have the potential to complement the broader trend towards multilingualism in NLP by providing foundational insights and resources necessary for effectively addressing the linguistic diversity of global communication. However, monolingual NLP surveys are extremely rare in literature. This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys. Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks. We include a classification of Language Resources (LRs), according to their availability, and datasets, according to their annotation, to highlight publicly-available and machine-actionable LRs. By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022, providing a comprehensive overview of the current state and challenges of Greek NLP research. We discuss the progress of Greek NLP and outline encountered Greek LRs, classified by availability and usability. As we show, our proposed method helps avoid common pitfalls, such as data leakage and contamination, and to assess language support per NLP task. We consider this systematic literature review of Greek NLP an application of our method that showcases the benefits of a monolingual NLP survey. Similar applications could be regard the myriads of languages whose progress in NLP lags behind that of well-supported languages.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Europe > Slovenia (0.04)
- (23 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
- Law (1.00)
- Information Technology (1.00)
- Education > Curriculum > Subject-Specific Education (0.92)
- (7 more...)
The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data
Paraskevopoulos, Georgios, Tsoukala, Chara, Katsamanis, Athanasios, Katsouros, Vassilis
The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus of Modern Greek from podcasts and employ Whisper large-v3 to generate silver transcriptions. This corpus is utilized to fine-tune our models, aiming to assess the efficacy of this approach in enhancing ASR performance. Our analysis spans 16 distinct podcast domains, alongside evaluations on established datasets for Modern Greek. The findings indicate consistent WER improvements, correlating with increases in both data volume and model size. Our study confirms that assembling large, weakly supervised corpora serves as a cost-effective strategy for advancing speech technologies in under-resourced languages.
OYXOY: A Modern NLP Test Suite for Modern Greek
Kogkalidis, Konstantinos, Chatzikyriakidis, Stergios, Giannikouri, Eirini Chrysovalantou, Katsouli, Vassiliki, Klironomou, Christina, Koula, Christina, Papadakis, Dimitris, Pasparaki, Thelka, Psaltaki, Erofili, Sakellariou, Efthymia, Soupiona, Hara
This paper serves as a foundational step towards the development of a linguistically motivated and technically relevant evaluation suite for Greek NLP. We initiate this endeavor by introducing four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation (through example comparison or sense selection) and metaphor detection. More than language-adapted replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community. Firstly, our inference dataset is the first of its kind, marking not just \textit{one}, but rather \textit{all} possible inference labels, accounting for possible shifts due to e.g. ambiguity or polysemy. Secondly, we demonstrate a cost-efficient method to obtain datasets for under-resourced languages. Using ChatGPT as a language-neutral parser, we transform the Dictionary of Standard Modern Greek into a structured format, from which we derive the other three tasks through simple projections. Alongside each task, we conduct experiments using currently available state of the art machinery. Our experimental baselines affirm the challenging nature of our tasks and highlight the need for expedited progress in order for the Greek NLP ecosystem to keep pace with contemporary mainstream research.
- North America > Canada > Ontario > Toronto (0.05)
- North America > Dominican Republic (0.04)
- Europe > Spain (0.04)
- (10 more...)
GRDD: A Dataset for Greek Dialectal NLP
Chatzikyriakidis, Stergios, Qwaider, Chatrine, Kolokousis, Ilias, Koula, Christina, Papadakis, Dimitris, Sakellariou, Efthymia
In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.
- Europe > Greece > West Macedonia > Kozani (0.05)
- Europe > Middle East > Cyprus (0.05)
- Europe > Germany > Saxony > Leipzig (0.05)
- (9 more...)
Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems A case study for Modern Greek
Paraskevopoulos, Georgios, Kouzelis, Theodoros, Rouvalis, Georgios, Katsamanis, Athanasios, Katsouros, Vassilis, Potamianos, Alexandros
Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a $120$ hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when a only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.